From WordNet to CELEX: acquiring morphological links from dictionaries of synonyms
نویسنده
چکیده
Morphological resources such as CELEX do not exist for many languages. NLP and RI systems that operate on texts and documents written in these languages have then to rely on morphological resources acquired from lexica or corpora. These resources usually suffer from a problem of precision because no a priori semantic knowledge is used for their acquisition. The paper proposes a robust and language independent technique to acquire morphological constructional relations from dictionaries of synonyms. The idea is to explore simultaneously synonymy and morphological relations in order to make more accurate prediction. The paper presents an evaluation of the technique and a comparison of the acquired morphological links with the CELEX database. 1. Constructional morphology for NLP and IR In the last decade, the interest for morphology and especially constructional1 morphology has been growing in theoretical linguistics and in computational linguistics, related domains as information retrieval (IR). Some recent experiments in IR (Xu and Croft, 1998; Jacquemin and Tzoukermann, 1999) have shown that constructional morphology can contribute to improve the efficiency of IR systems. For highly inflected languages as French, a proper treatment of inflexional morphology is imperative (Namer, 2000). This is less the case for for poorly inflected languages (Krovetz, 1993). Word formation is commonly regarded as lexical. For instance, (Bybee, 1988; Bybee, 1995) develops a theory where the lexicon is viewed as a network of lexical items (eg. fully inflected forms) connected to each other by relations set up according to shared semantic and phonological features. From a computational point of view, word formation can be dealt with in two ways: by means of a morphological analyzer such as Englex (Antworth, 1990) for English (based on the two level model) or DeriF (Dal et al., 1999; Namer and Dal, 2000) for French (based on the SILEX model). This solution has many limitations: it is expensive; lot of linguistic knowledge has to be implemented into the morphological analyzers which implies a long and tight collaboration between linguists and programmers; morphological analyzers cannot be easily adapted to other languages; they strongly depend on their underlying linguistic models... by means of morphological databases such as CELEX (Baayen et al., 1995). This solution can only be used We adopt the terminology proposed by Danièle Corbin and her team (Corbin, 2001); we prefer the term “constructional” to “derivational” which does not always imply a single notion. for a very small number of languages: Dutch, English and German. For instance, as far as we know, no such database is available for romance languages. One reason of this lack of morphological databases is that their creation of is quite expensive. However, several methods of supervised and unsupervised acquisition of constructional morphology have been proposed by authors. All of them involves some amount of symbolic or statistical learning. The input may be lexical data: inflected forms as in (Gaussier, 1999) and (Hathout, 2000) or medical nomenclature as in (Grabar and Zweigenbaum, 1999). More often, morphological knowledge is acquired from text corpora as do (Jacquemin, 1997), (Goldsmith, 2001), (Schone and Jurafsky, 2000; Schone and Jurafsky, 2001) or (Déjean, 1998). The learning of constructional morphology relies on a double approximation: 1. Word forms are good approximation of the phonological features. 2. Word forms can be used as approximation of words meaning: word forms that share a long enough substring are associated to lexemes that have good chances to be semantically related. Corpus based methods can be very helpfull for specific NLP and RI tasks. In particular, they can adapt to the vocabulary of the texts. However their results cannot be easily accumulated into databases repositories because most of them do not have a sufficient precision. The problem of precision is common to all methods and tools that do not use a priori linguistic knowledge. While the first above approximation is quite satisfactory, the second one is very coarse and cannot be improved without integrating a minimal amount of semantic knowledge in the process. Semantics can be either included in the tools or described in an external resource. The latter option is superior to the former because it preserves the generality of the method and guarantees its independence from individual languages. 2. Combining word formation and synonymy into analogies Almost all unsupervised methods that acquire constructional morphology from corpora or lexicons proceed in two steps: 1. they connect word forms by stripping and adding graphemic affixes; 2. the connections are then filtered on the base on various parameters such as the frequency of the stripping/adding patterns, the number of characters stripped/added, the co-occurrence of the words in some segments of text (eg. fixed size windows), the similarity of the words contexts (measured by means of TF IDF weighting), etc. The weakness of the methods (especially regarding precision) comes from the nature of the information used to decide whether the words can be connected or not: either it is statistical or it relies on a priori approximation. This problem may be solved by the use of ressources that contain some semantic knowledge and which have been build or checked by humans. These ressources include lexical databases like WordNet and machine readable dictionaries (MDRs). Among these, dictionaries of synonyms are perfectly suited for semantic filtering for at least three reasons: 1. they have a uniform and standard format; 2. most of their information is encoded explicitly (it is made up of binary synonymy relations between entries); 3. synonymy relations is almost exactly the kind of semantic knowledge we are looking for (they precisely hold between words that share semantic features). Dictionaries of synonyms have additional desirable features:2 they exists (at least in printed form) for a many languages; their format does not depend on individual languages; they often have a quite small size (they can be made machine readable at a reasonable cost). However, synonymy relations usually hold between words that belong to different constructional families while word formation connect members of the same family. Synonymy relations can be viewed as orthogonal to the constructional ones. Never the less, they can be easily exploited because the relation “share semantic features with” is transitive. More specifically, we aim at filtering the morphological links predicted on the base of the sharing of a common graphemic substring. For instance, in figure 1, abandon/V and abandonment/N are connected because they share the By an abuse of language, we will use the term synonymy for relations that might be better termed semantic proximity. abandon/V abandonment/N desert/V
منابع مشابه
Development of Myanmar-English Bilingual WordNet like Lexicon
A bilingual concept lexicon is of significance for Information Extraction (IE), Machine Translation (MT), Word Sense Disambiguation (WSD) and the like. Myanmar-English Bilingual WordNet like Lexicon (MEBWL) is developed to fulfill the requirements of Language Acquisition (LA). However, it is reasonably difficult to build such a lexicon is quite challenging in time and cost consuming. To overcom...
متن کاملAutomatic Construction of Persian ICT WordNet using Princeton WordNet
WordNet is a large lexical database of English language, in which, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets). Each synset expresses a distinct concept. Synsets are interlinked by both semantic and lexical relations. WordNet is essentially used for word sense disambiguation, information retrieval, and text translation. In this paper, we propose s...
متن کاملNANYANG TECHNOLOGICAL UNIVERSITY SCHOOL OF HUMANITIES AND SOCIAL SCIENCES Creating derivational morphology links in Wordnet Bahasa
Derivational morphology links are created for the Wordnet Bahasa, a combined Indonesian and Malay online lexical dictionary (Nurril Hirfana, Suerya, & Bond, 2011). The focus was to link root words to affixed words as affixation is one of the more apparent word formation processes in Bahasa Melayu. MorphInd, an Indonesian morphological analyser (Larasati, Kubon, & Zeman, 2011), is used to breakd...
متن کاملProcessing and extracting data from an open dictionary of the Portuguese language
Synonyms dictionaries are useful resources for natural language processing. Unfortunately their availability in digital format is limited, as publishing companies do not release their dictionaries in open digital formats. Dicionário-Aberto (Simões and Farinha, 2010) is an open and free digital synonyms dictionary for the Portuguese language. It is under public domain and in textual digital form...
متن کاملThe Core of the Czech Derivational Dictionary
Amongst all available language resources for the Czech language one can find a lot of useful dictionaries, databases and corpora. There are machine readable dictionaries of literary Czech (Havránek, 1989; Filipec, 1998), the dictionary of Czech synonyms (Pala, 2000) and two encyclopaedia: Otto and Diderot. Moreover, Czech researchers have two morphological databases (Hajič, 2001; Sedláček and S...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002